Today we will…
tidyrdplyr“For each demographic group listed below, determine all words in this study that were the most and least familiar, on average.”
Tidy data…
R4DS
Illustration by Allison Horst
We may need to transform our data to turn it into the version of tidy that is best for a task at hand.
Illustration by Allison Horst
We want to look at mean cereal nutrients based on shelf.
| name | manuf | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100% Bran | N | cold | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1 | 0.33 | 68.40297 |
| 100% Natural Bran | Q | cold | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1 | 1.00 | 33.98368 |
| All-Bran | K | cold | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1 | 0.33 | 59.42551 |
| All-Bran with Extra Fiber | K | cold | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1 | 0.50 | 93.70491 |
| Almond Delight | R | cold | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1 | 0.75 | 34.38484 |
| Apple Cinnamon Cheerios | G | cold | 110 | 2 | 2 | 180 | 1.5 | 10.5 | 10 | 70 | 25 | 1 | 1 | 0.75 | 29.50954 |
| shelf | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 102.5000 | 2.650000 | 0.60 | 176.2500 | 1.6850000 | 15.80000 | 4.800000 | 75.50000 | 20.00000 |
| 2 | 109.5238 | 1.904762 | 1.00 | 145.7143 | 0.9047619 | 13.61905 | 9.619048 | 57.80952 | 23.80952 |
| 3 | 107.7778 | 2.861111 | 1.25 | 158.6111 | 3.1388889 | 14.50000 | 6.527778 | 129.83333 | 35.41667 |
my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")
cereal_wide |>
ggplot() +
geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
geom_line(aes(x = shelf, y = calories, color = "calories_col")) +
geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
scale_color_manual(values = my_colors, labels = names(my_colors)) +
labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")| shelf | Nutrient | mean_amount |
|---|---|---|
| 1 | calories | 102.5000000 |
| 1 | carbo | 15.8000000 |
| 1 | fat | 0.6000000 |
| 1 | fiber | 1.6850000 |
| 1 | potass | 75.5000000 |
| 1 | protein | 2.6500000 |
| 1 | sodium | 176.2500000 |
| 1 | sugars | 4.8000000 |
| 1 | vitamins | 20.0000000 |
| 2 | calories | 109.5238095 |
| 2 | carbo | 13.6190476 |
| 2 | fat | 1.0000000 |
| 2 | fiber | 0.9047619 |
| 2 | potass | 57.8095238 |
| 2 | protein | 1.9047619 |
| 2 | sodium | 145.7142857 |
| 2 | sugars | 9.6190476 |
| 2 | vitamins | 23.8095238 |
| 3 | calories | 107.7777778 |
| 3 | carbo | 14.5000000 |
| 3 | fat | 1.2500000 |
| 3 | fiber | 3.1388889 |
| 3 | potass | 129.8333333 |
| 3 | protein | 2.8611111 |
| 3 | sodium | 158.6111111 |
| 3 | sugars | 6.5277778 |
| 3 | vitamins | 35.4166667 |
Consider daily rainfall observed in SLO in January 2023.
How would you manually convert this to long format?
Create a new column: Day_of_Week.
Create a new column: Rainfall (hold daily rainfall values).
Now we have three columns (Week, Day_of_Week, and Rainfall) – start moving Sunday values over.
Duplicate Week 1-5 and copy Monday values over.
Duplicate Week 1-5 and copy Tuesday values over.
Continue for the rest of the days of the week.
You may want to arrange() by Week to get the rainfall values chronological order.
We can use pivot_longer() to turn a wide dataset into a long(er) dataset.
pivot_longer()Take a wide dataset and turn it into a long daaset.
cols – specify the columns that should be pivoted.
names_to – the name of the new column containing the old column names.values_to – the name of the new column containing the old column values.pivot_longer()| Week | Day_of_Week | Daily_Rainfall |
|---|---|---|
| 1 | Sunday | 0.00 |
| 1 | Monday | 0.12 |
| 1 | Tuesday | 0.00 |
| 1 | Wednesday | 1.58 |
| 1 | Thursday | 0.91 |
| 1 | Friday | 0.00 |
| 1 | Saturday | 0.05 |
| 2 | Sunday | 0.27 |
| 2 | Monday | 4.26 |
| 2 | Tuesday | 0.43 |
| 2 | Wednesday | 0.00 |
| 2 | Thursday | 0.00 |
| 2 | Friday | 0.16 |
| 2 | Saturday | 1.41 |
| 3 | Sunday | 0.34 |
| 3 | Monday | 0.33 |
| 3 | Tuesday | 0.00 |
| 3 | Wednesday | 0.00 |
| 3 | Thursday | 0.13 |
| 3 | Friday | 0.00 |
| 3 | Saturday | 0.00 |
| 4 | Sunday | 0.00 |
| 4 | Monday | 0.00 |
| 4 | Tuesday | 0.00 |
| 4 | Wednesday | 0.00 |
| 4 | Thursday | 0.00 |
| 4 | Friday | 0.00 |
| 4 | Saturday | NA |
| 5 | Sunday | NA |
| 5 | Monday | NA |
| 5 | Tuesday | NA |
| 5 | Wednesday | NA |
| 5 | Thursday | NA |
| 5 | Friday | NA |
| 5 | Saturday | NA |
pivot_wider()Take a long dataset and turn it into a wide daaset.
id_cols – specify the column(s) that contain the ID for unique rows in the wide dataset.names_from – the name of the column containing the new column names.values_from – the name of the column containing the new column values.pivot_wider()We calculate the mean amount of protein for cereals on each shelpf and for each manufacturer.
| manuf | shelf | mean_protein |
|---|---|---|
| A | 2 | 4.000000 |
| G | 1 | 3.000000 |
| G | 2 | 1.285714 |
| G | 3 | 2.666667 |
| K | 1 | 2.750000 |
| K | 2 | 2.142857 |
| K | 3 | 2.916667 |
| N | 1 | 2.666667 |
| N | 2 | 2.500000 |
| N | 3 | 4.000000 |
| P | 1 | 1.500000 |
| P | 2 | 1.000000 |
| P | 3 | 3.000000 |
| Q | 1 | 5.000000 |
| Q | 2 | 2.000000 |
| Q | 3 | 2.500000 |
| R | 1 | 2.000000 |
| R | 3 | 3.000000 |
pivot_wider()| manuf | 1 | 2 | 3 |
|---|---|---|---|
| G | 3.000000 | 1.285714 | 2.666667 |
| K | 2.750000 | 2.142857 | 2.916667 |
| N | 2.666667 | 2.500000 | 4.000000 |
| P | 1.500000 | 1.000000 | 3.000000 |
| Q | 5.000000 | 2.000000 | 2.500000 |
| R | 2.000000 | NA | 3.000000 |
| A | NA | 4.000000 | NA |
pivot_wider()| manuf | Shelf_1 | Shelf_2 | Shelf_3 |
|---|---|---|---|
| G | 3.000000 | 1.285714 | 2.666667 |
| K | 2.750000 | 2.142857 | 2.916667 |
| N | 2.666667 | 2.500000 | 4.000000 |
| P | 1.500000 | 1.000000 | 3.000000 |
| Q | 5.000000 | 2.000000 | 2.500000 |
| R | 2.000000 | NA | 3.000000 |
| A | NA | 4.000000 | NA |
Multiple, interconnected tables of data are called relational.
IMDb movie relational data
We can combine (join) data tables based on their relations.
Mutating joins
Add variables from a new dataframe to observations in an existing dataframe.
full_join(), left_join(), right_join(), inner_join(), outer_join()
Filtering Joins
Filter observations based on values in new dataframe.
semi_join(), anti_join()
A key uniquely identifies an observation in a data set.
inner_join()Keeps obsertvations when their keys are present in both datasets.
inner_join(): IMDb Example| director_id | genre | prob |
|---|---|---|
| 429 | Adventure | 0.750000 |
| 429 | Fantasy | 0.750000 |
| 2931 | Drama | 0.714286 |
| 2931 | Action | 0.428571 |
| 11652 | Sci-Fi | 0.500000 |
| 11652 | Action | 0.500000 |
| 14927 | Animation | 1.000000 |
| 14927 | Family | 1.000000 |
| 15092 | Comedy | 0.545455 |
| 15092 | Crime | 0.545455 |
ID: 429, 2931, 11652, 14927, 15092 ID: 429, 9247, 11652, 14927, 15092
| director_id | genre | prob | movie_id |
|---|---|---|---|
| 429 | Adventure | 0.750000 | 300229 |
| 429 | Fantasy | 0.750000 | 300229 |
| 11652 | Sci-Fi | 0.500000 | 10920 |
| 11652 | Sci-Fi | 0.500000 | 333856 |
| 11652 | Action | 0.500000 | 10920 |
| 11652 | Action | 0.500000 | 333856 |
| 14927 | Animation | 1.000000 | 192017 |
| 14927 | Family | 1.000000 | 192017 |
| 15092 | Comedy | 0.545455 | 109093 |
| 15092 | Comedy | 0.545455 | 237431 |
| 15092 | Crime | 0.545455 | 109093 |
| 15092 | Crime | 0.545455 | 237431 |
ID: 429, 2931, 9247, 11652, 14927, 15092
inner_join(): IMDb ExampleWhat if our key does not have the same name?
| director_id | genre | prob |
|---|---|---|
| 429 | Adventure | 0.750000 |
| 429 | Fantasy | 0.750000 |
| 2931 | Drama | 0.714286 |
| 2931 | Action | 0.428571 |
| 11652 | Sci-Fi | 0.500000 |
| 11652 | Action | 0.500000 |
| 14927 | Animation | 1.000000 |
| 14927 | Family | 1.000000 |
| 15092 | Comedy | 0.545455 |
| 15092 | Crime | 0.545455 |
| id | first_name | last_name | genre | prob |
|---|---|---|---|---|
| 429 | Andrew | Adamson | Adventure | 0.750000 |
| 429 | Andrew | Adamson | Fantasy | 0.750000 |
| 11652 | James (I) | Cameron | Sci-Fi | 0.500000 |
| 11652 | James (I) | Cameron | Action | 0.500000 |
| 14927 | Ron | Clements | Animation | 1.000000 |
| 14927 | Ron | Clements | Family | 1.000000 |
| 15092 | Ethan | Coen | Comedy | 0.545455 |
| 15092 | Ethan | Coen | Crime | 0.545455 |
left_join() – keep only (and all) observations in the left data set
right_join() – keep only (and all) observations in the right data set
full_join() – keep all observations in both data sets
Which directors would remain for each of the following?
left_join(directors_genres, movies_directors)right_join(directors_genres, movies_directors)full_join(directors_genres, movies_directors)semi_join()Keeps observations when their keys are present in both datasets, but only keeps variables from the first dataset.
→
semi_join()| director_id | genre | prob |
|---|---|---|
| 429 | Adventure | 0.750000 |
| 429 | Fantasy | 0.750000 |
| 11652 | Sci-Fi | 0.500000 |
| 11652 | Action | 0.500000 |
| 14927 | Animation | 1.000000 |
| 14927 | Family | 1.000000 |
| 15092 | Comedy | 0.545455 |
| 15092 | Crime | 0.545455 |
Movie Directors: 429, 2931, 11652, 14927, 15092
anti_join()Removes observations when their keys are present in both datasets, and only keeps variables from the first dataset.
→
anti_join()Remember: the dataset you pipe in becomes the first argument of the function you are piping into!
…is equivalent to…
Today you will be tidying messy data to explore the relationship between countries of the world and military spending.
For this challenge, you will be using table joins to solve a murder mystery!
Today we will…
dplyr package updates
library(janitor)Converts all names of variables in a data set to snake_case.
[1] "Country" "Notes" "Reporting year" "1988"
[5] "1989" "1990" "1991" "1992"
[9] "1993" "1994" "1995" "1996"
[13] "1997" "1998" "1999" "2000"
[17] "2001" "2002" "2003" "2004"
[21] "2005" "2006" "2007" "2008"
[25] "2009" "2010" "2011" "2012"
[29] "2013" "2014" "2015" "2016"
[33] "2017" "2018" "2019"
[1] "country" "notes" "reporting_year" "x1988"
[5] "x1989" "x1990" "x1991" "x1992"
[9] "x1993" "x1994" "x1995" "x1996"
[13] "x1997" "x1998" "x1999" "x2000"
[17] "x2001" "x2002" "x2003" "x2004"
[21] "x2005" "x2006" "x2007" "x2008"
[25] "x2009" "x2010" "x2011" "x2012"
[29] "x2013" "x2014" "x2015" "x2016"
[33] "x2017" "x2018" "x2019"
Learn more about lifecycle stages of packages, functions, function arguments in R.
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
# A tibble: 3 × 35
Country Notes `Reporting year` `1988` `1989` `1990` `1991` `1992` `1993`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Africa <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 North Africa <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 Sub-Saharan <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# ℹ 26 more variables: `1994` <chr>, `1995` <chr>, `1996` <chr>, `1997` <chr>,
# `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>, `2002` <chr>,
# `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>, `2007` <chr>,
# `2008` <chr>, `2009` <chr>, `2010` <chr>, `2011` <chr>, `2012` <chr>,
# `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>,
# `2018` <chr>, `2019` <chr>
# A tibble: 3 × 35
Country Notes `Reporting year` `1988` `1989` `1990` `1991` `1992` `1993`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Africa <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 North Africa <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 Sub-Saharan <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# ℹ 26 more variables: `1994` <chr>, `1995` <chr>, `1996` <chr>, `1997` <chr>,
# `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>, `2002` <chr>,
# `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>, `2007` <chr>,
# `2008` <chr>, `2009` <chr>, `2010` <chr>, `2011` <chr>, `2012` <chr>,
# `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>,
# `2018` <chr>, `2019` <chr>
dplyr updatesThere was an update to the dplyr package to Version 1.1.0 on Sunday 1/29/2023.
group_by() vs .by = argumentacross()join_by()Multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important.
Keys:
# A tibble: 34 × 3
id first_name last_name
<dbl> <chr> <chr>
1 429 Andrew Adamson
2 2931 Darren Aronofsky
3 9247 Zach Braff
4 11652 James (I) Cameron
5 14927 Ron Clements
6 15092 Ethan Coen
7 15093 Joel Coen
8 15901 Francis Ford Coppola
9 15906 Sofia Coppola
10 16816 Cameron Crowe
# ℹ 24 more rows
# A tibble: 36 × 4
id name year rank
<dbl> <chr> <dbl> <dbl>
1 10920 Aliens 1986 8.20
2 17173 Animal House 1978 7.5
3 18979 Apollo 13 1995 7.5
4 30959 Batman Begins 2005 NA
5 46169 Braveheart 1995 8.30
6 109093 Fargo 1996 8.20
7 111813 Few Good Men, A 1992 7.5
8 112290 Fight Club 1999 8.5
9 116907 Footloose 1984 5.80
10 124110 Garden State 2004 8.30
# ℹ 26 more rows
join_one_data <- movies_directors |>
inner_join(directors,
by = c("director_id" = "id")
)
join_one_data# A tibble: 41 × 4
director_id movie_id first_name last_name
<dbl> <dbl> <chr> <chr>
1 429 300229 Andrew Adamson
2 2931 254943 Darren Aronofsky
3 9247 124110 Zach Braff
4 11652 10920 James (I) Cameron
5 11652 333856 James (I) Cameron
6 14927 192017 Ron Clements
7 15092 109093 Ethan Coen
8 15092 237431 Ethan Coen
9 15093 109093 Joel Coen
10 15093 237431 Joel Coen
# ℹ 31 more rows
If you update dplyr…
join_two_data <- movies_directors |>
inner_join(directors,
by = c("director_id" = "id")
) |>
inner_join(movies,
by = c("movie_id" = "id")
) |>
rename(movie_name = name)
join_two_data# A tibble: 41 × 7
director_id movie_id first_name last_name movie_name year rank
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 429 300229 Andrew Adamson Shrek 2001 8.10
2 2931 254943 Darren Aronofsky Pi 1998 7.5
3 9247 124110 Zach Braff Garden State 2004 8.30
4 11652 10920 James (I) Cameron Aliens 1986 8.20
5 11652 333856 James (I) Cameron Titanic 1997 6.90
6 14927 192017 Ron Clements Little Mermaid, The 1989 7.30
7 15092 109093 Ethan Coen Fargo 1996 8.20
8 15092 237431 Ethan Coen O Brother, Where Art T… 2000 7.80
9 15093 109093 Joel Coen Fargo 1996 8.20
10 15093 237431 Joel Coen O Brother, Where Art T… 2000 7.80
# ℹ 31 more rows
Using the hiphop data from Lab 3…
# A tibble: 10,752 × 6
word subj folk rock country pop
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 [to be] ghost p15 0 1 3 4
2 [to be] ghost p53 0 0 1 3
3 [to be] ghost p136 0 0 0 0
4 [to be] ghost p36 1 1 2 2
5 [to be] ghost p118 0 1 0 3
6 [to be] ghost p106 0 0 1 1
7 [to be] ghost p70 2 0 0 0
8 [to be] ghost p107 0 1 3 1
9 [to be] ghost p102 0 0 0 0
10 [to be] ghost p83 0 4 4 4
# ℹ 10,742 more rows
# A tibble: 10,752 × 3
word participant familiarity
<chr> <chr> <dbl>
1 [to be] ghost p15 1
2 [to be] ghost p53 1
3 [to be] ghost p136 1
4 [to be] ghost p36 1
5 [to be] ghost p118 1
6 [to be] ghost p106 1
7 [to be] ghost p70 1
8 [to be] ghost p107 1
9 [to be] ghost p102 5
10 [to be] ghost p83 1
# ℹ 10,742 more rows
{.r .cell-code code-line-numbers="3"-4""} music_wordfam <- music |> full_join(familiarity, by = c("word" = "word", "subj" = "participant") ) music_wordfam
# A tibble: 10,752 × 7
word subj folk rock country pop familiarity
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 [to be] ghost p15 0 1 3 4 1
2 [to be] ghost p53 0 0 1 3 1
3 [to be] ghost p136 0 0 0 0 1
4 [to be] ghost p36 1 1 2 2 1
5 [to be] ghost p118 0 1 0 3 1
6 [to be] ghost p106 0 0 1 1 1
7 [to be] ghost p70 2 0 0 0 1
8 [to be] ghost p107 0 1 3 1 1
9 [to be] ghost p102 0 0 0 0 5
10 [to be] ghost p83 0 4 4 4 1
# ℹ 10,742 more rows
Lab 4: Avocado Prices + Challenge 4: Avocado Toast Ate My Mortgage
Handy Helpers
rename() – Change names of columns
separate() – Separate values of a variable
Filtering Joins
semi_join(): Keeps values found in another data set
anti_join(): Keeps values not found in another data set